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Abstract 



Pre-calculated libraries of molecular fragment configurations have pre- 
viously been used as a basis for both equilibrium sampling (via library- 
based Monte Carlo) and for obtaining absolute free energies using a polymer- 
growth formalism. Here, we combine the two approaches to extend the 
size of systems for which free energies can be calculated. We study a 
series of all-atom poly-alanine systems in a simple dielectric solvent and 
find that precise free energies can be obtained rapidly. For instance, for 12 
residues, less than an hour of single-processor is required. The combined 
approach is formally equivalent to the annealed importance sampling al- 
gorithm; instead of annealing by decreasing temperature, however, inter- 
actions among fragments are gradually added as the molecule is grown. 
We discuss implications for future binding affinity calculations in which a 
ligand is grown into a binding site. 



Contents 

1 Introduction [3 

2 Methods S 

2.1 Polymer growth with relaxation: annealed importance sampling . 

2.2 Choice of stages: Progressive addition of interactions 

2.3 Free energy calculation in annealed importance sampling 



1 



2.4 Library-based Monte Carlo for relaxation [sj 

2.5 System and simulation details 

2.6 Validation method [13 

3 Validation and Results [l3 

3.1 Sampling Validation [13 



3.2 Free Energy Measurements and Statistics [11 



4 Discussion 12 



4.1 Sampling quality and free energy precision [12 



4.2 Limitations of the relaxed growth method [12 



4.3 Possible improvements [13 



4.4 Future applications to protein affinity estimates [13 



4.5 Improvements in implementation compared to previous work . . [14 



5 Summary and Conclusions [14 



6 Acknowledgements [15 



2 



1 Introduction 



Free energy differences, A.F, are fundamental to physical chemistry In the 
context of biomacromolecules, AF values can quantify folding stability, relative 
populations, and binding affinity T. Although computer simulations have been 
used to estimate AF values for biomolecules, success has been hampered by 
well-appreciated sampling problems [U [2] . 

A large number of numerical techniques have been used to calculate molecular 
free energy differences, but a smaller subset is capable of estimating absolute 
free energies values 

F = -keT InZ (1) 

where Z is the dimensionless configurational partition function. The most 
straightforward way to estimate an absolute free energy is using a reference 
system with an exactly calculable free energy (Fref), so that F — Frof + AF. 
AF is then obtained using a standard free energy difference technique, yield- 
ing the absolute free energy of the system. This long-established strategy (e.g., 
[3]) was first suggested for molecular systems by Stoessel and Nowak using a 
harmonic reference system [3]. Other strategies for calculating absolute free 
energies are also possible, as demonstrated by the work of Gilson and coworkers 
[D [i H [3 [HI 12] , as well as by Meirovitch and coworkers [TUl HH [H [H [M [13 [S] 
and Brooks and coworkers [T71 [TSl [TH] ■ 

The present study builds on earlier work in our group using reference systems to 
calculate absolute free energies for molecular fragments which subsequently are 
combined into a full molecule [301 [H] • This polymer-growth strategy employs 
a reference system of non-interacting fragments (e.g. amino acids), which are 
combined to yield a AF correction accounting for all interactions in the full 
molecule. Our earlier studied yielded the absolute free energy for tetra-alanine 
(Ace-Ala4-Nme), but could not easily be applied to significantly larger systems 
pn] . Clark et al. developed a closely related fragment-based approach for es- 
timating binding affinities without however, accounting for fragment flexibility 

m 

Here we employ a rigorous "annealing" strategy [23l [24l [25l [26] which inte- 
grates the polymer-growth approach with our recently developed library-based 
Monte Carlo (LMBC) method [17|[5S]. We anneal not by lowering tempera- 
ture, but by adding interactions between previously non-interacting fragments. 
The addition of interactions is formally equivalent to "growing" the polymer 
[20l[27]. Interactions among fragments are added gradually over several stages, 
permitting the calculation of incremental free energy differences until the full 
molecule and final free energy is produced. A weighted configurational ensem- 
ble is generated at each stage, which is then "relaxed" by canonical simulation 
based on the stage-specific interactions. This alternation of adding interactions 
and relaxing is formally equivalent to "annealed importance sampling" (AIS) 



3 



P51 [231 [551 [21] • We also employ resampling as a variance-reduction technique, 
following our previous temperature-based annealing |26) . 

The annealing strategy is enhanced by our use of LBMC for the relaxation 
phases. Like other free energy methods, annealed importance sampling employs 
canonical sampling (here termed "relaxation") at each stage - which can be 
performed by any algorithm that correctly samples the stage-specific distribu- 
tions. In the present study, LBMC is a natural choice because it is based on 
fragments and was shown to be highly efficient for sampling flexible peptides 
in implicit solvent [55] • The addition of arbitrary interactions for staging is 
also straightforward in LBMC. Nevertheless, the sampling method is "orthog- 
onal" to the free energy calculation, and other canonical sampling algorithms 
(e.g. [11 [301 [311 [33 [p, '34', "as", [Ml [H] ) possibly combined with hardware-based 
improvements [371 [3H1 39 , 40 , 4T] could be used instead of LBMC. 

Aside from the issue of calculating absolute free energies, the annealing approach 
for obtaining AF values can be set in the context of other methods which 
directly couple equilibrium sampling and free energy estimation [HI [171 ISH [Ml 
[251 [261 [H [in [g [451 [46]. That is, once the "staging" of the AF calculation 
is established -adding interactions in our case, and incrementing a parameter 
A in many others [3] - equilibrium sampling at each stage can be performed 
independently or in a coupled way. Thermodynamic integration [5] performs 
independent simulations for each stage, but in many more recent approaches, 
ensembles at a given stage are used to aid sampling at other stages [18l [171 
[23l[24l[25l[26l[42l[43l[44l[4g with "lambda dynamics" being a good example 
|47l [T9l [TtI [T8] . The annealing strategy couples sampling to staging in a uni- 
directional way, presumably starting from the stage which is easiest to sample. 
In our case, it is trivial to generate fully independent configurations for the 
reference stage of non-interacting fragments; interactions are then "annealed 
in." 

The annealing approach combining library-based polymer-growth and LBMC 
yields good results in this initial study, significantly extending our previous 
growth work, which was limited to peptides of 5 residues [20]. Using this 
method, we can compute precise free energies for Ace-Alai2-Nme in several 
hours and for peptides up to 22 residues in about two weeks of computing on 
a single 3.0 GHz processor core. Free energies and equilibrium ensembles of 
smaller peptides can be obtained in seconds or minutes depending on the de- 
sired precision. We validate our results by comparing the equilibrium ensembles 
produced during annealing to independent Langevin simulations with a collec- 
tive simulation time of 1 fis. 
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2 Methods 



We describe how our earlier fragment-based growth method |20)can be improved 
using library-based Monte Carlo [371 [53]. Formally, however, our procedure is 
not new, but is a special case of "annealed importance sampling" (AIS) [531 [3S]. 
We therefore describe our procedure in terms of AIS, which indeed provides a 
very natural formal framework. In our approach however, instead of lowering 
temperature as in AIS (i.e. annealing in temperature space), we incrementally 
add interactions among molecular fragments. Said another way, interactions 
between fragments are incrementally "annealed in" - i.e. simply turned on - 
between successive growth stages. 

2.1 Polymer growth with relaxation: annealed importance 
sampling 

The formalism introduced by Neal as AIS [23] will be applied to generalize 
standard polymer growth algorithms; see also [24]. It can be described in a 
straightforward way based on an arbitrary set of un-normalized distributions 
Tri{x), with i — 1,2, . . . , N representing the index of the (growth) stage. In our 
case, these distributions are standard Boltzmann factors: 

^,(x) a e-^'(^)/"-^' (2) 

where Ui is the potential energy at growth stage i at temperature T,;. In our 
case, annealing is performed only in interaction space so that Ti = 29SK for 
all i. Following the previous convention in our growth study j20| . the initial 
distribution is tti, and tt^t is the targeted distribution. In physical terms, tti 
will represent the distribution of all atoms in non- interacting fragments and ttjv 
will be the fully interacting molecule. Full details of the stages are given below 
in sec. 12.21 

Our AIS procedure has only a few simple steps and follows our earlier work 
[26] . The process starts with a well-sampled ensemble of M configurations at 
stage i = 1 in the initial distribution tti. AIS does not specify a procedure 
for sampling tti, but assumes it can be accomplished. In our case, the non- 
interacting fragment ensemble can be sampled almost perfectly using internal 
coordinate Monte Carlo, as described in refs [20 l [271128] . 

The ensemble progresses to the next growth stage by "annealing in" interac- 
tions - i.e. "turning on" interactions - between fragments according to the 
growth pathway shown in Fig. 1 and equations [5] The annealing process shown 
schematically in Fig. 1 corresponds to the case where we are growing a target 
molecule composed of smaller non-overlapping fragments x = {xa,xb, ...,xy). 

Formally, the M configurations from the current stage i are resampled into the 
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next distribution tt^+i based on the weights: 

w{x) = 7ri+i(x)/7ri(x) 



(3) 



There are numerous procedures for resampling '48| , but here we use the simplest 
approach of generating M new configurations for ensemble i + 1 proportional 
to the weights from Eq. [3l This approach leads to some higher-weight config- 
urations being duplicated - a fact which is exploited in AIS. After the simple 
resampling procedure, all weights become equal to one. 

Although the resampled set of M configurations for stage i + 1 is a statistically 
valid ensemble for Tr^+i, it has suffered some "attrition" in quality after growth. 
Specifically, the uncertainty in calculated observables will be larger than if we 
had M truly independent configurations - the non-independence is explicit in 
the duplicated configurations. In AIS, one therefore performs some "relaxation" 
simulation on each configuration in the ensemble. This can be done using any 
canonical sampling algorithm, thus preserving the tt^+i ensemble but improving 
the statistical quality. Our library-based procedure for canonical sampling is 
described in sec. 12.41 The degree of improvement in ensemble quality depends 
on the amount of relaxation, a point which we return to later. Nevertheless, after 
relaxation, a valid ensemble of M configurations in the Tr^+i ensemble remains. 
Reweighting and relaxation are repeated through the growth stages until the 
targeted distribution, ttjv, has been sampled. 

We summarize our AIS procedure as follows: 

(i) Generate an initial distribution of the wi ensemble for stage i = 1. This 
is performed by drawing a random configuration from the pre-calculated 
library for each fragment: see sec. 12.41 

(ii) Resample to the next stage, i + 1, by "annealing in" interactions via the 
weight w(x) = 7ri+i(x)/7ri(x). Our stages are specified in sec. 12.21 

(iii) Relax each configuration via any canonical sampling algorithm, e.g. LBMC, 
which maintains the tt^+i distribution. 

(iv) Repeat steps (ii) and (iii) until the target distribution ttn{'x) has been 
reached. 

2.2 Choice of stages: Progressive addition of interactions 

To establish notation, we first divide the full set of coordinates x into N non- 
overlapping fragments 

xa,xb,xc,—,xy (4) 

The total energy, U (x), of any fragment-based configuration can be decomposed 
into two parts. The first contribution is a sum over the energies internal to each 
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fragment (see Ui below); the second is a sum over energies between interacting 
fragment pairs (see Un below). 

For a target molecule consisting of N fragments, we employ N intermediate 
models (stages) such that interactions between fragments are gradually turned 
on along the growth pathway shown in Fig. 1. The first stage, i.e. the refer- 
ence state Ui corresponding to the distribution tti, is sampled at the library 
generation stage and only includes interactions internal to each fragment. Sub- 
sequent intermediate stages "anneal in" the indicated interactions among frag- 
ments A,B,C . . .Y. The energies of the intermediate models can be written 
recursively as: 

C/i(x) = Ua{xa) + Ub{xb) + Uc{xc) + --- + Uy{xy) 

C/2(x) = Ui{x.) + Uab{xa,xb) 

Usix) = U2{yi) + Uac{xa,xc) + Ubc{xb,xc) 

UN{Ti) = ?7jV-l(x) + UaY{Xa,XY) 

(5) 

The energy of the last stage, Un{x), is the full energy of the desired target 
molecule. The sum extends over interactions between the last fragment Y, with 
all previous fragments in the molecule. 

2.3 Free energy calculation in annealed importance sam- 
pling 

The free energy of the fully interacting target ensemble relative to the reference 
state can be expressed in terms of free energy differences between neighboring 
levels of the annealing ladder: 

Fat - Fi = (^2 - Fi) + (Fa - F2) + . . . + {Fn - Fn-i) (6) 

or 

AFi^AT = AFi,2 + AF2,3 + . . . + AFn-i,n (7) 

Possession of the ensembles at each level of the annealing ladder directly per- 
mits the calculation of free energy differences between levels. If the configura- 
tion space is progressively narrowed through the stages and the temperature is 
constant, as in our case 20 , the required values can be obtained simply using 

exp(-/3AF,,,+i) = (exp(-/3At/,,,+i)), (8) 
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where /SXJi^iJ^i — Ui+i — Ui and the ensemble average is over the configurations 
from stage i. 

Ahhough this relation is sufficient for our studies, in more difficult cases, "two- 
sided" calculations could be performed - e.g., using the Bennett method [H]. 

We also point out from Eq. [7] that if the absolute free energy of stage 1, Fi, is 
known then the absolute free energy of stage N can be simply found via 

Fn^Fi + AFi^N (9) 

Absolute free energies of the molecular fragments have already been determined 
in our previous work [20j , therefore it is straightforward to convert all free energy 
differences reported in this paper into their absolute values. 

2.4 Library-based Monte Carlo for relaxation 

AIS requires a canonical sampling procedure for the "relaxation" process, and 
we employ library-based Monte Carlo (LBMC)[171 [SO]. LBMC is a natural 
choice because it can employ the same fragments used in the staging choices, 
and is also highly efficient for sampling implicitly solvated peptides ^28^ . LBMC 
uses pre-generated libraries of fragment configurations, echoing extensive work 
with the Rosetta folding program 51 . LBMC is a canonical sampling procedure 
which can be used with an arbitrary forcefield and solvent model. Full details 
regarding LBMC have been given in previous work |27[l50[f25] but we summarize 
the essentials here. 

In simplest terms, LBMC is an ordinary MC procedure which can employ a 
special fragment-swap trial move: exchange of the configuration of a fragment 
with a pre-calculated configuration chosen from a "library" or ensemble of pre- 
calculated configurations. When the library is distributed according to the 
Boltzmann factor of the target forcefield for all interactions internal to the 
fragment, the Metropolis criterion[52] is particularly simple: 

Pace = min(l, exp[-(3AU'''']) (10) 

where AU'^^^^ is the change in fragment- fragment interaction energy due to the 
trial fragment swap. 

More precisely, if one is performing a trial swap move by changing a single 
fragment configuration xj — > xji , then 

AC/™''' = [UixA,...,xj,...,XY)-UixA,...,xj,,...,XY)] - [Uj>ixj,)-Ujixj)] 

(11) 

The first two energy terms are calculated per trial move. The second two energy 
terms are simply the energies of the single fragment configurations J and J' - 
they are extracted from the pre-calculated libraries. 
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Many variants of LBMC are possible, but this simple scheme has shown to be 
successful for flexible all-atom peptides [28]. In particular, all degrees of freedom 
are included in the libraries - so an amino acid fragment consists of all atomic 
coordinates plus the six connector degrees of freedom which exactly specifiy the 
position and orientation of the next fragment. Swap moves are attempted on 
configurations drawn uniformly from the Boltzmann distributed libraries. 

Libraries for Ace, Ala and Nme are generated by internal-coordinate Monte 
Carlo, as described in our earlier work [27]. Libraries are distributed according 
to the T — 298 K Boltzmann factor of all OPLSAA energy terms internal to 
each fragment - both bonded and non-bonded. The libraries include additional 
dummy atoms which encode the six degrees of freedom necessary for positioning 
a fragment with respect to the previous fragment [27]. The fragments Ace, Ala 
and Nme contain 6, 10, and 6 atoms respectively. Each fragment library con- 
tained 10^ such distinct configurations and their corresponding energies. Col- 
lectively, these libraries occupy approximately 300MB of computer memory. 
Although smaller libraries probably can be effective, we are still investigating 
optimal sizes. 

For the tti distribution of non-interacting fragments, LBMC is not necessary. 
Rather, the distribution is sampled by drawing a random configuration from 
the pre-calculated library for each fragment: see Ui in Eq. 2. 

2.5 System and simulation details 

We use library-based AIS to calculate free energy changes along the growth 
pathway in Fig. 1 for four polypeptide systems: Ace-Ala4-Nme, Ace-Alai2- 
Nme, Ace-Alai6-Nme Ace-Ala2o-Nme. For all systems under investigation, our 
libraries implemented the OPLS-AA forcefield [53^ with uniform and constant 
dielectric at constant temperature 298/^. The uniform dielectric constant was 
chosen to be e = 60 and no potential cutoffs are used in the calculation of the 
energy terms. While other implicit and explicit solvent models are within scope 
of library-based methods, the goal of this work was to extend library-based free 
energy calculations [3^ and demonstrate that complete sampling and accurate 
free energy measurements are easily attainable for larger systems. 

During each AIS simulation, the free energy change between growth stages i 
and i -|- 1 is measured according to equation (8) using configurations obtained 
throughout the relaxation procedure. For the purpose of examining our data, we 
also calculate intermediate Ai'i^i+i values as relaxation proceeds; these values 
are obtained using Eq. (8) for the set of M partially relaxed configurations at 
various "time" points. The final free energy difference between growth stages 
i and i -I- 1 is then determined by exponentially averaging all intermediate IS.F 
values. The total free energy difference between the target and reference system 
is calculated by summing the exponentially averaged free energy differences for 
each growth stage, i.e. via equation (7). 
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For the systems under investigation, we repeat simulations with various amount 
of relaxation and ensemble sizes to observe the effects on sampling quality and 
AF fluctuations. In principle, these parameters may be adjusted automati- 
cally until a desired threshold for free energy accuracy is achieved, however this 
automation is not implemented in our current work; see sec. 14.31 The total 
number of relaxation steps is between 10'' — 10^ LBMC trial moves for each sim- 
ulation and the steps are distributed evenly over each growth stage. Note that 
relaxation in the early stages of growth is considerably faster than later stages 
because there are fewer terms to calculate in Eq.(lO). Specifically, to grow a 
single polyalanine chain Ace-Ala„-Nme containing n alanine residues requires 
3.6 * 10"+^ energy evaluations. For each simulation, at least 10 repeats have 
been performed in order to obtain accurate statistics on variations in sampling 
quality and free energies A_Fijv. 

2.6 Validation method 

To validate proper sampling of our systems, we compare the target ensemble 
of configurations with those obtained from ten independent Langevin Dynamics 
trajectories. Such a comparison is possible by choosing a strongly discriminating 
representation of phase space based on a Voronoi construction described below. 
The ten independent Langevin simulations were performed using TINKER for 
a collective run time of 1/xs. All Langevin simulations were run at T = 298K 
with a friction constant of 5 psec~^. 

We compare the target systems' phase-space distributions with those obtained 
from well-sampled Langevin Dynamics simulations. Briefly, to obtain a rep- 
resentation of the phase space distribution, we choose flve independent and 
dissimilar reference structures (similarity metric is based on RMSD) from the 
Langevin Dynamics trajectory of the target system. Configuration space is then 
partitioned into 5 distinct regions or "bins" based on a Voronoi procedure so 
that each bin contains all configurations closest to one reference structure. This 
representation of the phase-space distribution provides an extremely sensitive 
test which is not always "passed" as seen in the next section. Full details for 
this procedure can be found in ref. [531 [551 IMl HO] • 

3 Validation and Results 
3.1 Sampling Validation 

First, to check whether sufficient sampling has been performed. Fig. 2 plots 
the configuration-space distributions mentioned in sec. 12.61 for the four systems 
examined. In each plot, we compare the distributions resulting library-based 
AIS and Langevin Dynamics simulations. Error bars have a width of two stan- 
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dard deviations and are based on results obtained from at least 10 independent 
simulations of each method. In all cases, there is good agreement between both 
methods, validating the free energy measurements and sampling capacity of this 
method. Note that although the sampling error bars are large for the Ace-Ala2o- 
Nme system, the free energy standard deviation is still reasonably small(0.39 
kcal/mol) as described below. 

For reference, in Fig. 3 we plot the configuration-space distributions for the 
peptide Ace-Alai2-Nme as obtained from three methods: pure LBMC of the full 
system (no staging), pure growth as in ref. |20| (no relaxation), and Langevin 
dynamics simulations. The data underscores the fact that pure library-based 
growth simulations are unable to sample these larger systems. However, by 
implementing relaxation combined with growth (i.e. AIS), we are able to recover 
the correct equilibrium distribution - compare Fig. 3 and Fig. 2b. 

3.2 Free Energy Measurements and Statistics 

To assess the free energy estimates, we report the mean and standard deviation 
of the free energy for each polypeptide relative to the non-interacting reference 
state for varying amounts of relaxation and ensemble sizes as shown in Table 1. 
The statistics are based on at least 10 independent simulations for each set of 
parameters. The table indicates the computing time required for different levels 
of precision, although further optimization may be possible fsec. 14. 3| ). 

The principal result embodied in Table 1 is that for polyalanine systems up 
to 16 residues, only a couple of hours is required to reach a level of precision 
comparable to the accuracy of forcefields (~0.5 kcal/mol ) 1^. It can be seen 
from Table 1 that free energy variances can be decreased by increasing the 
overall amount of relaxation. Importantly, however, increasing the ensemble 
size implies that each configuration will receive less relaxation if the simulations 
are to be run in equal amounts of time. Although the sampling error bars are 
large for the Ace-Ala2o-Nme system (e.g Fig. 2(d) ), the free energy estimate is 
still reasonably precise, with a standard deviation of 0.39 kcal/mol. 

The series of constant-time simulations for Ace-Alai2-Nme in Table 1 indicates 
that decreasing the ensemble size by a factor is 10 is roughly equivalent to 
increasing the number of relaxation steps by the same factor insofar as free 
energy variances are concerned. 

A representative plot of intermediate AF measurements during the relaxation 
of each growth stage for the Ace-Alai2-Nme system is shown in Fig. 4 along with 
the exponentially averaged results shown in the inset for each growth stage. In 
Fig. 4, the sharp decrease in AFi^i+i at growth stage i = 6 is attributed to 
the fact that, on average, configurations in the ensemble become long enough 
so that more contacts are formed , e.g. H-bonds and steric clashes. The LBMC 
acceptance rate (not shown) also follows a similar trend since there is a larger 
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chance of steric overlap when more atoms are interacting. 

We investigated the issue of the choice of ensemble size (M) for a given comput- 
ing investment. Thus, for Ace-Alai2-Nme, four separate runs were performed, 
each using a total of 10® relaxation steps as shown in Table 1. The standard 
deviations, ct, suggest M ^ lO'^ is close to optimal for this system. 

4 Discussion 

4.1 Sampling quality and free energy precision 

What kind of sampling quality is required for reasonable precision (0.5 kcal/mol) 
in free energy estimates? To address this issue, it is interesting to compare re- 
sults obtained in cases of good and poor sampling. We have measured the 
free energy Ai^i^^r using pure growth simulations (no relaxation) for the four 
polypeptide systems examined, see Table 1. Pure growth simulations, as men- 
tioned previously, will under-sample configuration space for these larger sys- 
tems - see Fig. 3 for example. In the case of the smallest system we examined, 
Ace-Ala4-Nme, relaxation makes very little difference in sampling quality since 
pure growth simulations are able to accurately predict the correct equilibrium 
distributions and free energies [20]. However, for the larger systems Ace-Alai2- 
Nme and Ace-Alaig-Nme, the difference in mean AFi^n values as obtained from 
poorly-sampled (e.g.. Fig. 3) and well-sampled simulations(e.g. Fig. 2b) are 1.18 
and 1.62 kcal/mol respectively. This underscores the fact that seemingly small 
differences in free energies could mask poorly-sampled ensembles. Such differ- 
ences would be expected to increase in more complex systems. 

The coupling of sampling and free energy calculation suggests one can examine 
such approaches purely in terms of their ability to provide equilibrium sampling. 
To set this in context, note that we recently showed that a fragment-based 
polymer growth strategy without annealing/relaxation could sample equilibrium 
ensembles of all-atom peptides but was limited to about five residues [20]. The 
present study, by adding a relaxation phase, greatly extends the size of peptides 
which can be sampled. 

4.2 Limitations of the relaxed growth method 

The essential limitation of our approach, not surprisingly, is sampling. It is 
not a coincidence that our present implementation becomes dramatically more 
expensive when studying systems beyond the efficient range of current LBMC 
simulation, roughly 16-20 residues for polyalanine. Other sampling methods, or 
improved versions of LBMC, likely could extend the system sizes amenable for 
free energy estimation. The basic requirements, as is implicit in Eq. (8) and the 
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subsequent discussion, is to be able to sample the ensembles at every stage to 
a sufficient degree to permit calculation of the required free energies. Improved 
staging, possibly more incremental, could also be useful for larger systems. 

4.3 Possible improvements 

Insofar as annealed importance sampling is a formal method based on (i) arbi- 
trary stages and (ii) an arbitrary (correct) sampling method, improvements in 
either of those components will improve free energy calculations. Sampling itself 
could be improved in a number of ways: with LBMC variants better optimized 
for structured (e.g., folded) systems; with alternative sampling algorithms; or 
with other hardware optimizations, such as based on graphics processing units 
(GPUs), for LBMC or for more traditional algorithms [37] ■ The relaxation of M 
configurations at every stage may be amenable to CPU or GPU parallelization. 

Staging could also be improved, as is generally the case in free energy cal- 
culations [571 158] . Our current procedure adds all interactions for a newly 
added fragment in a single stage - but this becomes a significant perturbation 
as system size increases. The interaction-space staging scheme naturally per- 
mits "sub-staging" of added interactions, and this strategy will be explored in 
future work. 

However, even for a fixed set of stages and a specified sampling algorithm, sig- 
nificant further optimization may be possible. For instance, as Fig. 4 illustrates 
graphically, some stages are much easier than others - i.e., have greater overlap. 
This suggests that the amount of relaxation per stage could be adjusted on the 
fly based on a convergence criterion, perhaps for a block-averaged variance j59j 
of free energy estimates. The size of the ensemble, M, could also be optimized in 
a system and stage-specific way. These issues will be taken up in future studies. 

4.4 Future applications to protein affinity estimates 

The resources expended in this study were quite modest, and suggest signifi- 
cantly larger systems could be addressed with more computing. Yet we do not 
believe that the present methodology would permit the growth and sampling of 
all-atom proteins. What are the prospects for estimating binding affinities? 

Importantly, the estimation of binding affinities would not require the growth 
of a full protein. Rather, the affinity a ligand for a receptor would be calculated 
from the difference of free energy values calculated by growing the ligand into 
solvent and into the receptor binding site. In other words, only the ligand 
needs to be grown, and typical ligands are smaller than the peptides studied 
in this report. Such a growth procedure adds interactions, and can be seen as 
the inverse of a decoupling procedure |60[ 161] . At the same time, free energy 
values require good sampling of the full system, which is never easy for proteins. 
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In this regard, LBMC was designed to handle hybrid models in a natural way 
- with an atomistic binding site and a reduced representation elsewhere [50] . 
Good sampling of such models via LBMC may permit rapid, statistically based 
affinity estimates within the context of hybrid models. Such a strategy echoes 
the work of Roux and coworkers [BH |M1 ISSI , but hybrid models would permit 
potentially important allosteric coupling to regions distant from the binding site. 
Another strategy based on fragments has been proposed [22^, but it does not 
account for flexibility internal to fragments. 

4.5 Improvements in implementation compared to previ- 
ous work 

By comparison with previous a previous study by our group (5IJI, the reported 
computing times in Table 1 may seem incongruously fast. Our earlier study ex- 
amined a series of peptides, with the largest being tetra-alanine, noting that 50 
minutes of computing time were required to obtain a precision of 0.29 kcal/mol 
for tetra-alanine. By contrast, in our present work, the same "pure growth" 
calculation (no relaxation) required 14 s. The improvement results primarily 
from the fact that our previous work was a scripting based post-analysis of 
data generated by LBMC code, whereas we now calculate free energy values di- 
rectly within the LBMC code. Additionally, the growth pathway implemented 
in this study differs slightly from that in ref. [20] . Our current growth pathway 
(Fig. 1) adds new fragments in a more efficient way, increasing overlap between 
neighboring growth stages. 

5 Summary and Conclusions 

We reported free energy calculations for implicitly solvated polyalanine peptides, 
ranging in size from four to 20 residues. The calculations combine two previ- 
ously developed techniques, a fragment-based polymer growth strategy [20] and 
library-based Monte Carlo simulation [27l [28] . Our new implementation greatly 
extends the system sizes amenable to free energy estimation compared to a pre- 
vious study by our group 20 . Because the calculations for the peptides are 
so inexpensive, we hope the approach can be useful for protein-ligand affinity 
estimation in the future, as described in sec. 14.41 The present calculations re- 
quired seconds to days of single-CPU computing, depending on system size and 
required precision. 

The results here are another application of the memory-intensive strategy of 
using pre-calculated libraries of molecular- fragment configurations [^11 [13 [2H] . 
That strategy has been useful for rapid sampling of semi-atomistic protein mod- 
els [27] and of implicitly solvated peptides [28] . Libraries of configurations have 
previously been applied extensively in the Rosetta protein folding software [5T] 
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albeit not for canonical sampling. The potential for ongoing improvements in 
memory size and access speed, orthogonal to CPU speed, suggests the value of 
continued pursuit of memory-intensive computations. 

From a formal point of view, we have shown that one can perform annealing in 
interaction space. That is, our approach is formally equivalent to the annealed 
importance sampling strategy described by Neal [23], except that instead of 
lowering temperature, we add interactions among molecular fragments. 
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Stage 1: (Reference state) non-interacting 
fragments 



Growth Pathway: 
Interactions between different 
fragments are turned on 
gradually 



Stage N: (Target state) fully-interacting 
fragments 
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Figure 1: A general schematic of the growth and relaxation simulation proce- 
dure. Each box represents an ensemble of fragments. The process starts with 
non-interacting fragments (e.g. the amino acids) which are then used to "grow" 
the full molecule. At each intermediate stage, the next fragment in the sequence 
is added by "turning on" the additional interactions due to that fragment. Free 
energy changes, Ai^, are computed and accumulated at each growth stage. This 
procedure is repeated until the fully interacting molecule is obtained at stage N . 
The grey colored blocks represent ensembles which have been relaxed according 
to library-based Monte Carlo. 
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Figure 2: Comparison of the equilibrium distributions obtained using both AIS 
and Langevin Dynamics simulations for the four systems under investigation. 
Correct equilibrium sampling has is confirmed for systems a), b) and c) using the 
new method. System d) has not reached the precision of the 100 usee Langevin 
simulations. Error bars for both of these methods have a width of two standard 
deviations and are a result of statistics obtained from at least 10 independent 
simulations. 
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Figure 3: Comparison of equilibrium distributions obtained by growth, LBMC 
and standard Langevin Dynamics simulations. The plot shows the fractional 
population in different "bins" of phase space for the peptide Ace-Alai2-Nme. 
The error bars have a width of 2 standard deviations and are based on data 
from 10 independent simulations. The lines drawn between data points are a 
guide to the eye. For a system of this size, a simple growth procedure without 
relaxation fails to produce the correct distribution. 
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Figure 4: Intermediate free energy changes measured during the relaxation pro- 
cess for each growth stage for Ace-Alai2-Nme. Each data point represents Eq. 
(8) appUed to a subset of the data - different colors are displayed to more clearly 
distinguish each growth stage. Shown in the inset is the exponentially averaged 
final free energy difference that is obtained from averaging all data via. Eq. (8) 
at each growth stage. Lines between data points are drawn to guide the eye. 
These are representative results from a single simulation. 
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Table 1: Free energy statistics. The mean values (/x) and standard deviations 
(ct) in kcal/mol for the four peptide systems are tabulated as a function of 
ensemble size and amount of relaxation. Statistics are based on at least ten 
independent simulations. 



System 


Ensemble Size (M) 


Tot no. Steps 


/j(AFi,jv) 


<T(AFl,Ar) 


CPU time 


Acc-Ala4-Nmc 




no relaxation/pure growth 


-5.581 


0.043 


14 sec. 




10^ 


10^ 


-5.562 


0.001 


7.8 min. 


Ace-Alai2-Nme 


10* 


no relaxation/pure growth 


-20.19 


0.985 


20 sec. 




lOi 


10^ 


-21.22 


0.299 


1 hr. 




lOi 


10* 


-21.21 


0.235 


8.6 hrs. 




10^ 


10* 


-21.38 


0.223 


8.6 hrs. 




10* 


10* 


-21.24 


0.169 


8.6 hrs. 




10^ 


10* 


-21.28 


0.191 


8.6 hrs. 




10^ 


10^ 


-21.37 


0.057 


3.5 days 


Ace-Alai6-Nme 


lO'' 


no relaxation/pure growth 


-29.78 


1.191 


35 sec. 




lOi 


10^ 


-30.78 


0.459 


1.9 hrs. 




10^ 


10* 


-31.02 


0.396 


16.6 hrs. 




10" 


10'' 


-31.40 


0.179 


7 days 


Ace-Ala2o-Nme 


10^ 


no relaxatiou/purc growth 


-38.99 


2.101 


13 sec. 




lOi 


10* 


-41.66 


0.870 


1.3 days 




10* 


10^ 


-42.45 


0.394 


14 days 
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